library(dplyr)
library(ggplot2)

library(magrittr)



Getting started

The CPS_2018 data set contains wage data collected by the Current Population Survey (CPS) in 2018. Using these data, we can explore the relationship between annual wages and marital status (married or single) among 18-34 year olds. The original codebook is here. And the data reflect how the government collects data regarding employment.

# Load the data
CPS_2018 <- read.csv("https://www.macalester.edu/~ajohns24/data/CPS_2018.csv")
  
# Let's just look at 18 - 34 year olds and people that make under 250,000
CPS_2018 <- CPS_2018 %>% 
  filter(age >= 18, age <= 34) %>% 
  filter(wage < 250000)


We’ll start with a small model of wage vs marital status:

# Visualize the relationship
ggplot(CPS_2018, aes(y = wage, x = marital)) + 
  geom_boxplot()

# Model the relationship
CPS_model_1 <- lm(wage ~ marital, CPS_2018)
summary(CPS_model_1)$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    46145.23    921.062  50.10002 0.000000e+00
## maritalsingle -17052.37   1127.177 -15.12839 5.636068e-50





Exercise 1: Controlling for age

Let’s start simple, by controlling for age (an imperfect measure of experience) in our model of wages by marital status. NOTE: Pause to think about the coefficients before answering the questions below.

# Construct the model
CPS_model_2 <- lm(wage ~ marital + age, CPS_2018)
summary(CPS_model_2)$coefficients
##                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)   -19595.948  3691.6998 -5.308110 1.184066e-07
## maritalsingle  -7500.146  1191.8526 -6.292847 3.545964e-10
## age             2213.869   120.7701 18.331265 2.035782e-71
  1. Visualize this model by modifying the hint code in a new chunk. NOTE: Don’t get distracted by the last line. This is just making sure that the geom_smooth matches our model assumptions.
# HINT CODE: don't change
ggplot(___, aes(y = ___, x = ___, color = ___)) + 
  geom____(size = 0.1, alpha = 0.5) + 
  geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25)
# Your solution
ggplot(CPS_2018, aes(y = wage, x = age, color = marital)) + 
  geom_jitter(size = 0.1, alpha = 0.5) + 
  geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25)

  1. Suppose two workers are the same age, but one is married and the other is single. According to the model, by how much do we expect the single worker’s wage to differ from the married worker’s wage?

About \(\$7500\).

  1. With your answer to part b in mind, which of the options below is another way to interpret the maritalsingle coefficient? On average…
    • Single workers make $7500 less than married workers.
    • When controlling for (“holding constant”) age, single workers make $7500 less than married workers.



Exercise 2: Controlling for more covariates

We’ve seen that among all workers, the wage gap of single vs married workers is $17,052. We also saw that age accounts for some of this difference. Let’s see what happens when we control for even more workforce covariates: model wages by marital status while controlling for age and years of education:

CPS_model_3 <- lm(wage ~ marital + age + education, CPS_2018)
summary(CPS_model_3)$coefficients
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)   -64898.607  4099.8737 -15.829416 2.254709e-54
## maritalsingle  -6478.094  1119.9345  -5.784351 7.988760e-09
## age             1676.796   116.3086  14.416777 1.102113e-45
## education       4285.259   207.2158  20.680173 3.209448e-89
  1. Challenge: Construct a single visualization of the relationship among these 4 variables. Hot tip: Start by visualizing wage vs age, and step by step incorporate information about the other predictors.
ggplot(CPS_2018, aes(x = age, y = wage, color = marital, size = education)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25)

  1. With so many variables, this is a tough model to visualize. If you had to draw it, how would the model trend appear: 1 point, 2 points, 2 lines, 1 plane, or 2 planes? Explain your rationale. Hot tip: pay attention to whether your predictors are quantitative or categorical.

2 planes, one for each category.

  1. Even if we can’t easily draw CPS_model_3, the coefficients contain the information we want! How can we interpret the education coefficient? On average…
    • Wages increase by $4285 for every extra year of education.
    • When controlling for marital status and age, wages increase by $4285 for every extra year of education.
    • People with an education make $4285 more than those that don’t.
  2. Which of the following is the best interpretation of the maritalsingle coefficient? On average…
    • When controlling for age and education, single workers make $6478 less than married workers.
    • Single workers make $6478 less than married workers.



Exercise 3: Even more covariates

Let’s control for another workforce covariate: model wages by marital status while controlling for age (quantitative), years of education (quantitative), and the industry in which one works (categorical):

CPS_model_4 <- lm(wage ~ marital + age + education + industry, CPS_2018)
summary(CPS_model_4)$coefficients
##                                   Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)                     -52498.857  7143.8481 -7.3488206 2.533275e-13
## maritalsingle                    -5892.842  1105.6898 -5.3295615 1.053631e-07
## age                               1493.360   116.1673 12.8552586 6.651441e-37
## education                         3911.117   243.0192 16.0938565 4.500408e-56
## industryconstruction              5659.082  6218.5649  0.9100302 3.628760e-01
## industryinstallation_production   1865.650  6109.2613  0.3053806 7.600964e-01
## industrymanagement                1476.884  6031.2901  0.2448704 8.065727e-01
## industryservice                  -7930.403  5945.6509 -1.3338158 1.823603e-01
## industrytransportation           -1084.176  6197.2462 -0.1749448 8.611342e-01
  1. Challenge: Construct a single visualization of the relationship among these 5 variables. Hot tip: utilize a facet_wrap(~ ___) ala Activity 5.
ggplot(CPS_2018, aes(x = age, y = wage, color = marital, size = education)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25) +
  facet_wrap(~ industry)

  1. If you had to draw it, how would the model trend appear: “z” points, “z” lines, or “z” planes? Specify “z” and explain your rationale.

12 planes, because we have two (mostly) smooth predictors, a binary categorical predictor, and a six-value categorical predictor.

  1. When controlling for a worker’s age, marital status, and education, which industry tends to have the highest wages? The lowest?

Construction, because it has the highest coefficient. Service is the lowest.

  1. Interpret the maritalsingle coefficient.

Being single decreases wage after controlling for the other factors by nearly 6000 dollars.



Exercise 4: An even bigger model

Consider two people with the same age, education, industry, typical number of work hours, and health status. However, one is single and one is married.

  1. Construct and use a new model to summarize the typical difference in the wages for these two people. Store this model as CPS_model_5.
CPS_model_5 <- lm(wage ~ marital + age + education + industry + hours + health, CPS_2018)
summary(CPS_model_5)$coefficients
##                                    Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)                     -64886.5747 6914.18198 -9.38456275 1.171028e-20
## maritalsingle                    -4992.7685 1061.84882 -4.70195794 2.687274e-06
## age                               1061.1410  115.83503  9.16079518 9.031462e-20
## education                         3443.7625  236.12723 14.58435151 1.128646e-46
## industryconstruction              5381.3857 5959.05620  0.90306007 3.665630e-01
## industryinstallation_production   2951.0372 5854.23981  0.50408547 6.142365e-01
## industrymanagement                5107.6364 5782.95334  0.88322283 3.771832e-01
## industryservice                  -3074.5127 5705.56537 -0.53886207 5.900201e-01
## industrytransportation            -207.3439 5940.02074 -0.03490626 9.721567e-01
## hours                              732.1340   43.72488 16.74410733 2.340115e-60
## healthfair                       -7407.7981 2901.71339 -2.55290483 1.072955e-02
## healthgood                       -2470.8096 1259.44276 -1.96182766 4.987035e-02
## healthpoor                       -9086.9110 7657.43781 -1.18667774 2.354441e-01
## healthvery_good                    292.5278 1020.89213  0.28654136 7.744823e-01
  1. After controlling for all of these workforce covariates, do you still find the remaining wage gap for single vs married people to be meaningfully “large”? Can you think of any remaining factors that might explain part of this remaining gap? Or do you think we’ve found evidence of inequitable wage practices for single vs married workers?

Nearly 5000 is still a meaningful amount change from baseline. There are few other categories present that are as strong. There are only two other categories, sex, disability and citizenship remaining, and they may be strongly linked to marriage which would account for its effect. Location is not a variable, and may also have an effect.



Exercise 5: Model comparison & selection

Take two workers, one of which is married and the other of which is single. The models above provided the following insights into the typical difference in wages for these two groups:

Model Assume the two people have the same… Wage difference R2
CPS_model_1 NA -$17,052 0.067
CPS_model_2 age -$7,500 0.157
CPS_model_3 age, education -$6,478 0.257
CPS_model_4 age, education, industry -$5,893 0.280
CPS_model_5 age, education, industry, hours, health -$4,993 0.341
  1. Though this won’t always be the case in every analysis, notice that the marital coefficient got closer and closer to 0 as we included more covariates in our model. Explain the significance of this phenomenon in the context of our analysis - what does it mean?

It means that marital status is correlated to these covariates, ie that adding the covariates as predictors implied a marital result, diminishing the predictive value of marital status to the model.

  1. We’ve built several models here – and there are dozens more that we could build but haven’t. Amongst so many options, it’s important to anchor our analysis and model building in our research goals. With respect to each possible goal below, explain which of our 5 models would be best.
    1. Our goal is to maximize our understanding of why wages vary from person to person. Maximize the number of variables to find the strongest predictor.
    2. Our goal is to understand the relationship between wages and marital status, ignoring all other possible factors in wages. Maximize the number of variables to find the true weight of marital status.
    3. Our goal is to understand the relationship between wages and marital status while controlling for a person’s age and education. Use only wages, marital status, age and education as predictors.
    4. Our goal is to maximize our understanding of why wages vary from person to person while keeping our model as simple as possible. Use health, marital status, and industry, as they have the largest effect in the biggest model.
  2. Consider goal iv. Explain why you think simplicity can be a good model feature.

Simplicity means that less data needs to be collected for higher confidence. Simpliciy makes the model easier to interpret.



Exercise 6: Data drill + curiosity

  1. We took a very narrow look at the wage data set. What other curiosities do you have about these data? Identify one new question of interest, and explore this question using the data. It can be a simple question and answered in 1 line / 1 set of lines of R code, so long as it’s not explored elsewhere in this activity.

I’m curious about the wage breakdown in different industries at different ages regarless of marriage or education.

CPS_model_4 <- lm(wage ~ age + industry, CPS_2018)
summary(CPS_model_4)$coefficients
##                                    Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)                     -29472.6767  6815.3475 -4.32445693 1.576335e-05
## age                               2091.2841   108.7642 19.22769203 5.215931e-78
## industryconstruction              9060.5366  6491.1956  1.39581939 1.628669e-01
## industryinstallation_production   7606.3564  6369.8554  1.19411761 2.325215e-01
## industrymanagement               19140.2256  6197.1591  3.08854835 2.028833e-03
## industryservice                    494.1208  6184.6252  0.07989502 9.363258e-01
## industrytransportation            4466.3275  6462.1215  0.69115499 4.895189e-01
  1. Use dplyr tools to complete the data drill below. (There are some structural hints in the online manual version!)
# Load and use the complete CPS_2018 data
CPS_2018 <- read.csv("https://www.macalester.edu/~ajohns24/data/CPS_2018.csv")

# What are the mean and median wage?
CPS_2018 %>%
  summarize(mean(wage), median(wage))
##   mean(wage) median(wage)
## 1   51330.61        38000
# What is the median wage in each industry?
CPS_2018 %>%
  group_by(industry) %>%
  summarize(mean(wage), median(wage))
## # A tibble: 6 x 3
##   industry                `mean(wage)` `median(wage)`
## * <chr>                          <dbl>          <dbl>
## 1 ag                            32115.          30000
## 2 construction                  44754.          36000
## 3 installation_production       46011.          38000
## 4 management                    70917.          55000
## 5 service                       34508.          25000
## 6 transportation                38147.          30000
# How many workers fall into each health group?
CPS_2018 %>%
  group_by(health) %>%
  count
## # A tibble: 5 x 2
## # Groups:   health [5]
##   health        n
##   <chr>     <int>
## 1 excellent  3162
## 2 fair        537
## 3 good       2610
## 4 poor         80
## 5 very_good  3611
# Obtain a dataset of workers aged 18 to 22 that are in good health
# Show just the first 6 rows
# Hint: <= is "less than or equal to"
CPS_2018 %>%
  filter(17 < age, age < 23) %>%
  head(6)
##    wage age education    sex marital                industry    health hours
## 1 33000  19        16 female  single              management very_good    40
## 2  4000  20        12   male  single installation_production very_good    54
## 3  8400  19        12   male  single                 service excellent    20
## 4 19250  22        13 female  single                 service excellent    40
## 5 18200  21        16 female  single                 service excellent    25
## 6 14000  19        11 female  single                 service excellent    40
##   citizenship disability education_level
## 1         yes         no       bachelors
## 2         yes         no              HS
## 3         yes         no              HS
## 4         yes         no    some_college
## 5         yes         no       bachelors
## 6         yes         no         some_HS
# In one set of lines, calculate the median age (in years) amongst workers in excellent health
# See online manual for structural hint
CPS_2018 %>%
  filter(health == "excellent") %>%
  summarize(median(age))
##   median(age)
## 1          38
# In one set of lines, calculate the median age (in MONTHS) amongst workers in excellent health
# See online manual for structural hint
CPS_2018 %>%
  filter(health == "excellent") %>%
  summarize(median(age * 12))
##   median(age * 12)
## 1              456





# Load data
library(palmerpenguins)
data(penguins)

(Art by @allison_horst)

Exercise 7: Getting started

  1. The flipper_length_mm variable currently measures flipper length in mm. Create a new variable named flipper_length_cm which measures flipper length in cm. Store this inside the penguins data. Hints:
    • Make sure you’ve defined the variable correctly before storing.
    • There are 10mm in a cm.
penguins %<>% mutate(flipper_length_cm = flipper_length_mm / 10)
  1. Run the code chunk below to build a bunch of models that you’ll be exploring in the exercises:
penguin_model_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins)
penguin_model_2 <- lm(bill_length_mm ~ flipper_length_cm, penguins)
penguin_model_3 <- lm(bill_length_mm ~ flipper_length_mm + flipper_length_cm, penguins)
penguin_model_4 <- lm(bill_length_mm ~ body_mass_g, penguins)
penguin_model_5 <- lm(bill_length_mm ~ flipper_length_mm + body_mass_g, penguins)



Exercise 8: Modeling bill length by flipper length

What can a penguin’s flipper (arm) length tell us about their bill length? To answer this question, we’ll consider 3 of our models:

model predictors
penguin_model_1 flipper_length_mm
penguin_model_2 flipper_length_cm
penguin_model_3 flipper_length_mm + flipper_length_cm
  1. Create and summarize two separate plots: bill_length_mm vs flipper_length_mm; and bill_length_mm vs flipper_length_cm.
point_and_line <- function(x){ x +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)}

penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) %>% 
  point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

summary(penguin_model_1)$coefficients
##                     Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)       -7.2648678 3.20015684 -2.27016 2.382330e-02
## flipper_length_mm  0.2547682 0.01588914 16.03410 1.743974e-43
penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_cm)) %>% 
  point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

summary(penguin_model_2)$coefficients
##                    Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)       -7.264868  3.2001568 -2.27016 2.382330e-02
## flipper_length_cm  2.547682  0.1588914 16.03410 1.743974e-43
  1. Before examining the model summaries, ask your gut:
    • Do you think the penguin_model_2 R-squared will be less than, equal to, or more than that of penguin_model_1?
    • Similarly, how do you think the penguin_model_3 R-squared will compare to that of penguin_model_1?

All three R-Squared are the same because they all reflect the same correlation. Constant multiples of the data doesn’t change the fit of the line.

  1. Check your gut: Report the R-squared values for the three penguin models and summarize how these compare.

All the same!

summary(penguin_model_1)$r.squared
## [1] 0.430574
summary(penguin_model_2)$r.squared
## [1] 0.430574
summary(penguin_model_3)$r.squared
## [1] 0.430574
  1. Explain why your observation in part c makes sense. Support your reasoning with a plot of the 2 predictors: flipper_length_mm vs flipper_length_cm.

The variables contribute the same information, so they have the same fit, and adding them both as predictors doesn’t actually add information for the model to use as a predictor.

penguins %>%
  ggplot(aes(x = flipper_length_mm, y = flipper_length_cm)) %>%
  point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

  1. Challenge: In summary(penguin_model_3), the flipper_length_cm coefficient is NA. Explain why this makes sense. HINT: Thinking about yesterday’s activity, why wouldn’t it make sense to interpret this coefficient? BONUS: For those of you that have taken MATH 236, this has to do with matrices that are not of full rank!

Because the data in cm is a scaled duplicate of mm, that data contributes nothing to the model, so a coefficient would be not applicable in this context.



Exercise 9: Incorporating body_mass_g

In this exercise you’ll consider 3 models of bill_length_mm:

model predictors
penguin_model_1 flipper_length_mm
penguin_model_4 body_mass_g
penguin_model_5 flipper_length_mm + body_mass_g
  1. Which is the better predictor of bill_length_mm: flipper_length_mm or body_mass_g? Provide some numerical evidence.
summary(penguin_model_1)$r.squared
## [1] 0.430574
summary(penguin_model_4)$r.squared
## [1] 0.3541557

Flipper length seems to be a better predictor of bill length than body mass because the r-squared of model 1 is higher than model 4.

  1. penguin_model_5 incorporates both flipper_length_mm and body_mass_g as predictors. Before examining a model summary, ask your gut: Will the penguin_model_5 R-squared be close to 0.35, close to 0.43, or greater than 0.6?
summary(penguin_model_5)$r.squared
## [1] 0.4328544

This is only slightly better than model 4.

  1. Check your intuition. Report the penguin_model_5 R-squared and summarize how this compares to that of penguin_model_1 and penguin_model_4.
penguins %>%
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

  1. Explain why your observation in part c makes sense. Support your reasoning with a plot of the 2 predictors: flipper_length_mm vs body_mass_g.

Body mass and flipper length are highly correlated, so body mass doesn’t add much more information to the model.



Exercise 10: Redundancy & Multicollinearity

The exercises above have illustrated special phenomena in multivariate modeling:

  • two predictors are redundant if they contain the same exact information

  • two predictors are multicollinear if they are strongly associated (they contain very similar information) but are not completely redundant.

  1. Which penguin model had redundant predictors and which predictors were these?

Model 3: mm / cm

  1. Which penguin model had multicollinear predictors and which predictors were these?

Model 5: mm / mass

  1. In general, what happens to the R-squared value if we add a redundant predictor into a model: will it decrease, stay the same, increase by a small amount, or increase by a significant amount?

Remain the same.

  1. Similarly, what happens to the R-squared value if we add a multicollinear predictor into a model?

Increase slightly.



Exercise 11: Overfitting

Not only does the strategy of adding more and more and more predictors complicate our models and have diminishing returns, it can give us silly results. Construct 2 models using a sample of 10 penguins (for illustration purposes only). NOTE: If you’re using an older version of RStudio, you might get a different sample from others but the ideas are the same!

# Take a sample of 10 penguins
# We'll discuss this code later in the course!
set.seed(155)
penguins_small <- sample_n(penguins, size = 10) %>% 
  mutate(flipper_length_mm = jitter(flipper_length_mm))

# 2 models
poly_mod_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins_small)
poly_mod_2 <- lm(bill_length_mm ~ poly(flipper_length_mm, 2), penguins_small)

# 2 R-squared
summary(poly_mod_1)$r.squared
## [1] 0.7341412
summary(poly_mod_2)$r.squared
## [1] 0.7630516
# Plot the 2 models
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)


  1. Adding a quadratic term to the model increased R-squared quite a bit. BUT! I! WANT! MORE! Adapt the code above to construct a model of bill_length_mm by poly(flipper_length_mm, 9) which has 9 polynomial predictors: flipper_length_mm, flipper_length_mm^2,…, flipper_length_mm^9.
poly_mod_9 <- lm(bill_length_mm ~ poly(flipper_length_mm, 9), penguins_small)
  1. Report the model’s R-squared and construct a visualization of this model (show the model trend on top of the scatterplot of raw data).
summary(poly_mod_9)$r.squared
## [1] 1
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 9), se = FALSE)

  1. OK. We’ve learned that it’s always possible to get a perfect R-squared of 1. However, our example here demonstrates the drawbacks of overfitting a model to our sample data. Comment on the following:
    • How easy is it to interpret this model? Really hard.
    • Would you say that this model captures the general trend of the relationship between bill_length_mm and flipper_length_mm? Probably not, there are weird dips and peaks.
    • How well do you think this model would generalize to the penguins that were not included in the penguins_small sample? Poorly. nontics tend to get big fast away from their inversion points.
  2. Zooming out, explain some limitations of relying on R-squared to measure the strength / usefulness of a model.

It is possible to maximize r-squared at the expensive of minimizing usefulness.

  1. Check out the image from the front page of the manual. Which plot pokes fun at overfitting?

The panel for “Connecting Lines” (“I clicked ‘smooth lines’ in Excel”).



Exercise 12: Model selection & curiosity

  1. There are so many models we could build! For each possible research goal, indicate which predictors you’d include in your model.
    i: We want to understand the relationship between bill length and depth when controlling for penguin species. I’d include mass and bill depth. ii: We want to be able to predict a penguin’s bill length from its flipper length alone (because maybe penguins let us get closer to their arms than their nose?). Flipper length only…

  2. Aren’t you so curious about penguins? Identify one new question of interest, and explore this question using the data. It can be a simple question and answered in 1 line / 1 set of lines / 1 plot of R code, so long as it’s not explored elsewhere in this activity.

It’s easier to weigh a penguin than measure one.

ggplot(penguins_small, aes(y = flipper_length_mm, x = body_mass_g)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)